Resampling Methods

Resampling is the process of repeatedly drawing subsamples from a training dataset, and fitting a model to each sample with the goal of discovering additional properties or information about the model. For example, in a regression modeling context, we can fit a particular regression model to each sample, and observe how the fits vary among the samples.

We will introduce two important resampling methods:

  • cross-validation
  • bootstrapping

Both have important uses in statistical and machine learning applications, particularly for assessing models, performing model selection, and estimating the precision of parameter estimates.

Cross-validation

One approach for evaluating the fit of a particular model is to divide the available dataset into two parts and:

  • use one subset to fit the model
  • used the other subset to test the model

What do we mean by "test"? If the model fit is a good one, then providing new data to the model should generate predicted outputs that are close to the observed outputs. This can be quantified by calculating the test error.


In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [ ]:
salmon = pd.read_table("../data/salmon.dat", delim_whitespace=True, index_col=0)
plt.scatter(x=salmon.spawners, y=salmon.recruits)

On the one extreme, a linear relationship is underfit; on the other, we see that including a very large number of polynomial terms is clearly overfitting the data.


In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(14,6))

xvals = np.arange(salmon.spawners.min(), salmon.spawners.max())

fit1 = np.polyfit(salmon.spawners, salmon.recruits, 1)
p1 = np.poly1d(fit1)
axes[0].plot(xvals, p1(xvals))
axes[0].scatter(x=salmon.spawners, y=salmon.recruits)

fit15 = np.polyfit(salmon.spawners, salmon.recruits, 15)
p15 = np.poly1d(fit15)
axes[1].plot(xvals, p15(xvals))
axes[1].scatter(x=salmon.spawners, y=salmon.recruits)

We can select an appropriate polynomial order for the model using cross-validation, in which we hold out a testing subset from our dataset, fit the model to the remaining data, and evaluate its performance on the held-out subset.


In [ ]:
test_set = salmon.sample(frac=0.3)

In [ ]:
train_set = salmon.drop(test_set.index)

A natural criterion to evaluate model performance is root mean square error.


In [ ]:
def rmse(x, y, coefs):
    yfit = np.polyval(coefs, x)
    return np.sqrt(np.mean((y - yfit) ** 2))

We can now evaluate the model at varying polynomial degrees, and compare their fit.


In [ ]:
# suppress warnings from Polyfit
import warnings
warnings.filterwarnings('ignore', message='Polyfit*')

degrees = np.arange(13)
train_err = np.zeros(len(degrees))
validation_err = np.zeros(len(degrees))

for i, d in enumerate(degrees):
    p = np.polyfit(train_set.spawners, train_set.recruits, d)

    train_err[i] = rmse(train_set.spawners, train_set.recruits, p)
    validation_err[i] = rmse(test_set.spawners, test_set.recruits, p)

fig, ax = plt.subplots()

ax.plot(degrees, validation_err, lw=2, label = 'cross-validation error')
ax.plot(degrees, train_err, lw=2, label = 'training error')

ax.legend(loc=0)
ax.set_xlabel('degree of fit')
ax.set_ylabel('rms error')

In the cross-validation above, notice that the error is high for both very low and very high polynomial values, while training error declines monotonically with degree. The cross-validation error is composed of two components: bias and variance. When a model is underfit, bias is low but variance is high, while when a model is overfit, the reverse is true.

One can show that the MSE decomposes into a sum of the bias (squared) and variance of the estimator:

$$\begin{aligned} \text{Var}(\hat{\theta}) &= E[\hat{\theta} - \theta]^2 - (E[\hat{\theta} - \theta])^2 \\ \Rightarrow E[\hat{\theta} - \theta]^2 &= \text{Var}(\hat{\theta}) + \text{Bias}(\hat{\theta})^2 \end{aligned}$$

The training error, on the other hand, does not have this tradeoff; it will always decrease (or at least, never increase) as variables (polynomial terms) are added to the model.

K-fold Cross-validation

In the example above, our validation was based on just one random split of the data. Try re-running the above example on a different random subset, and examine the result.

There are two issues with this:

  1. Your particular random subset may not be representative
  2. The fitted model will have higher variance relative to a model fit to the complete dataset

In k-fold cross-validation, the training set is split into k smaller sets. Then, for each of the k "folds":

  1. trained model on k-1 of the folds as training data
  2. validate this model the remaining fold, using an appropriate metric

The performance measure reported by k-fold CV is then the average of the k computed values. This approach can be computationally expensive, but does not waste too much data, which is an advantage over having a fixed test subset.

Example: Cross-validation subsets

Extend the subsetting operation from above to create a function to generate five random partitions of the dataset. Call the function gen_k_folds.


In [ ]:
# Write your answer here

We can now perform k-fold cross-validation, and report the average error over all of the folds.


In [ ]:
import warnings
warnings.filterwarnings('ignore', message='Polyfit*')

k = 5
degrees = np.arange(8)
k_fold_err = np.empty(len(degrees))

for i, d in enumerate(degrees):
    
    error = np.empty(k)
    training, testing = gen_k_folds(salmon, k)
    
    for j, fold in enumerate(zip(training, testing)):

        train, test = fold
        
        y_train, x_train = train.values.T
        y_test, x_test = test.values.T
        
        p = np.polyfit(x_train, y_train, d)
        
        error[j] = rmse(x_test, y_test, p)

    k_fold_err[i] = error.mean()
        

fig, ax = plt.subplots()

ax.plot(degrees, k_fold_err, lw=2)
ax.set_xlabel('degree of fit')
ax.set_ylabel('average rms error')

Bootstrapping

Parametric inference can be non-robust:

  • inaccurate if parametric assumptions are violated
  • if we rely on asymptotic results, we may not achieve an acceptable level of accuracy

Parmetric inference can be difficult:

  • derivation of sampling distribution may not be possible

An alternative is to estimate the sampling distribution of a statistic empirically without making assumptions about the form of the population.

We have seen this already with the kernel density estimate.

Non-parametric Bootstrap

The bootstrap is a resampling method discovered by Brad Efron that allows one to approximate the true sampling distribution of a dataset, and thereby obtain estimates of the mean and variance of the distribution.

Bootstrap sample:

$$S_1^* = \{x_{11}^*, x_{12}^*, \ldots, x_{1n}^*\}$$

$S_i^*$ is a sample of size $n$, with replacement.

In Python, we have already seen sampling. If we want to use NumPy for this, we can permute the column of names to obtain a sample:


In [ ]:
titanic = pd.read_excel("../data/titanic.xls", "titanic")

np.random.permutation(titanic.name)[:5]

Sampling is even easier in pandas; DataFrame and Series objects have sample methods that allow for sampling without the need for outside functions.


In [ ]:
titanic.name.sample(n=5)

We can use either method to generate a sample with replacement, which we can use when bootstrapping.


In [ ]:
titanic.name.sample(n=5, replace=True)

We regard S as an "estimate" of population P

population : sample :: sample : bootstrap sample

The idea is to generate replicate bootstrap samples:

$$S^* = \{S_1^*, S_2^*, \ldots, S_R^*\}$$

Compute statistic $t$ (estimate) for each bootstrap sample:

$$T_i^* = t(S^*)$$

In [ ]:
# Sample size
n = 10
# Bootstrap replicates
R = 1000

# Original sample (n=10)
data = np.random.normal(size=n)

# 1000 bootstrap samples of size 10
s = [data[np.random.randint(0,n,n)].mean() for i in range(R)]
_ = plt.hist(s, bins=30)

Bootstrap Estimates

From our bootstrapped samples, we can extract estimates of the expectation and its variance:

$$\bar{T}^* = \hat{E}(T^*) = \frac{\sum_i T_i^*}{R}$$$$\hat{\text{Var}}(T^*) = \frac{\sum_i (T_i^* - \bar{T}^*)^2}{R-1}$$

In [ ]:
boot_mean = np.sum(s)/R
boot_mean

In [ ]:
boot_var = ((np.array(s) - boot_mean) ** 2).sum() / (R-1)
boot_var

Since we have estimated the expectation of the bootstrapped statistics, we can estimate the bias of T:

$$\hat{B}^* = \bar{T}^* - T$$

In [ ]:
boot_mean - np.mean(data)

Bootstrap error

There are two sources of error in bootstrap estimates:

  1. Sampling error from the selection of $S$.
  2. Bootstrap error from failing to enumerate all possible bootstrap samples.

For the sake of accuracy, it is prudent to choose at least R=1000

Bootstrap Percentile Intervals

An attractive feature of bootstrap statistics is the ease with which you can obtain an estimate of uncertainty for a given statistic. We simply use the empirical quantiles of the bootstrapped statistics to obtain percentiles corresponding to a confidence interval of interest.

This employs the ordered bootstrap replicates:

$$T_{(1)}^*, T_{(2)}^*, \ldots, T_{(R)}^*$$

Simply extract the $100(\alpha/2)$ and $100(1-\alpha/2)$ percentiles:

$$T_{[(R+1)\alpha/2]}^* \lt \theta \lt T_{[(R+1)(1-\alpha/2)]}^*$$

In [ ]:
s_sorted = np.sort(s)
s_sorted[:10]

In [ ]:
s_sorted[-10:]

In [ ]:
alpha = 0.05
s_sorted[[(R+1)*int(alpha/2), (R+1)*int(1-alpha/2)]]

Exercise: Cervical dystonia bootstrap estimates

Use bootstrapping to estimate the mean of one of the treatment groups, and calculate percentile intervals for the mean.


In [ ]:
# Write your answer here

Missing Data

Missing data is a common problem in most real-world scientific datasets. While the best way for dealing with missing data will always be preventing their occurrence in the first place, it usually can't be helped, particularly when data are collected passively or voluntarily, or when data collection and recording is distributed among several people. There are a variety of ways for dealing with missing data, from the very naïve to the very sophisticated, and unfortunately the more common approaches tend to be ad hoc and will usually do more harm than good.

It turns out that more robust methods for imputation are not as difficult to implement as they first appear to be. Two of the best ones are Bayesian imputation and multiple imputation. In this section, we will use multiple imputation to account for missing data in a regression analysis.

As a motivating example, we will use a dataset of educational outcomes for children with hearing impairment. Here, we are interested in determining factors that are associated with better or poorer learning outcomes. There is a suite of available predictors, including gender (male), number of siblings in the household (siblings), an index of family involvement (family_inv), whether the primary household language is not English (non_english), the presence of a previous disability (prev_disab), non-white race (non_white), the age at the time of testing (in months, age_test), whether hearing loss is not severe (non_severe_hl), whether the subject's mother obtained a high school diploma or better (mother_hs), and whether the hearing impairment was identified by 3 months of age (early_ident).


In [ ]:
test_scores = pd.read_csv('../data/test_scores.csv', index_col=0)
test_scores.head()

For three variables in the dataset, there are incomplete records.


In [ ]:
test_scores.isnull().sum(0)

Strategies for dealing with missing data

The easiest (and worst) way to deal with missing data is to ignore it. That is, simply run the analysis, missing values and all, hoping for the best. If your software is any good, this approach will simply not work; the algorithm will try to operate on data that includes missing values, and propagate them, resulting in statistics and estimates that cannot be calculated, which will typically raise errors. If your software is poor, it will make some assumption or decision about the missing values, and proceed to generate results conditional on the assumption, which creates problems that may never be detected because no indication was given to any potential problem.

The next easiest (worst) approach to analyzing data with missing values is to conduct list-wise deletion, by deleting the records that have missing values. This is called complete case analysis, because only records that are complete get retained for the analysis. The degree to which complete case analysis is undesirable depends on the mechanism by which data have become missing.

Types of Missingness

  • Missing completely at random (MCAR): When data are MCAR, missing cases are, on average, identical to non-missing cases, with respect to the model. Ignoring the missingness will reduce the power of the analysis, but will not bias inference.
  • Missing at random (MAR): Missing data depends (usually probabilistically) on measured values, and hence can be modeled by variables observed in the data set. Accounting for the values which “cause” the missing data will produce unbiased results in an analysis.
  • Missing not at random(MNAR): Missing data depend on unmeasured or unknown variables. There is no information available to account for the missingness.

The very best-case scenario for using complete case analysis, which corresponds to MCAR missingness, results in a loss of power due to the reduction in sample size. The analysis will lose the information contained in the non-missing elements of a partially-missing record. When data are not missing completely at random, inferences from complete case analysis may be biased due to systematic differences between missing and non-missing records that affects the estimates of key parameters.

One alternative to complete case analysis is to simply fill (impute) the missing values with a reasonable guess a the true value, such as the mean, median or modal value of the fully-observed records. This imputation, while not recovering any information regarding the missing value itself for use in the analysis, does provide a mechanism for including the non-missing values in the analysis, thereby making use of all available information.

Performing imputation via mean substitution is easy in Pandas, via the DataFrame/Series fillna method.


In [ ]:
test_scores.siblings.mean()

In [ ]:
test_scores.siblings.fillna(test_scores.siblings.mean())

This approach may be reasonable under the MCAR assumption, but may induce bias under a MAR scenario, whereby missing values may differ systematically relative to non-missing values, making the particular summary statistic used for imputation biased as a mean/median/modal value for the missing values.

Beyond this, the use of a single imputed value to stand in place of the actual missing value glosses over the uncertainty associated with this guess at the true value. Any subsequent analysis procedure (e.g. regression analysis) will behave as if the imputed value were observed, despite the fact that we are actually unsure of the actual value for the missing variable. The practical consequence of this is that the variance of any estimates resulting from the imputed dataset will be artificially reduced.

Multiple Imputation

One robust alternative to addressing missing data is multiple imputation (Schaffer 1999, van Buuren 2012). It produces unbiased parameter estimates, while simultaneously accounting for the uncertainty associated with imputing missing values. It is conceptually and mechanistically straightforward, and produces complete datasets that may be analyzed using any statistical methodology or software one chooses, as if the data had no missing values to begin with.

Multiple imputation generates imputed values based on a regression model. This regression model will help us generate reasonable values, particularly if data are MAR, since it uses information in the dataset that may be informative in predicting what the true value may be. Ideally, we want predictor variables that are correlated with the missing variable, and with the mechanism of missingness, if any. For example, one might be able to use test scores from one subject to predict missing test scores from another; or, the probability of income reporting to be missing may vary systematically according to the education level of the individual.

To see if there is any potential information among the variables in our dataset to use for imputation, it is helpful to calculate the pairwise correlation between all the variables. Since we have discrete variables in our data, the Spearman rank correlation coefficient is appropriate.


In [ ]:
test_scores.dropna().corr(method='spearman')

We will try to impute missing values the mother's high school education indicator variable, which takes values of 0 for no high school diploma, or 1 for high school diploma or greater. The appropriate model to predict binary variables is a logistic regression. We will use the scikit-learn implementation, LogisticRegression.


In [ ]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

To keep things simple, we will only use variables that are themselves complete to build the predictive model, hence our subset of predictors will exclude family involvement score (family_inv) and previous disability (prev_disab).


In [ ]:
impute_subset = test_scores.drop(labels=['family_inv','prev_disab','score'], axis=1)

Next, we scale the predictor variables to range from 0 to 1, to improve the performance of the regression model.


In [ ]:
y = impute_subset.pop('mother_hs').values
X = StandardScaler().fit_transform(impute_subset.astype(float))

The training and test sets in this case will be the non-missing and missing values, respectively, since we want to use supervised learning to build our predictive model.


In [ ]:
missing = np.isnan(y)

Next, we create a LogisticRegression model, and fit it using the non-missing observations.


In [ ]:
mod = LogisticRegression()
mod.fit(X[~missing], y[~missing])

Conveniently, scikit-learn models have predict methods for generating predictions from the model, using new data. Here, we will pass the predictor values for the subset with mother_hs missing.


In [ ]:
mother_hs_pred = mod.predict(X[missing])
mother_hs_pred

These values can then be inserted in place of the missing values, and an analysis can be performed on the entire dataset.

However, this is still just a single imputation for each missing value, and hence glosses over the uncertainty associated with the derivation of the imputes. Multiple imputation proceeds by imputing several values, to generate several complete datasets and performing the same analysis on all of them. With a set of estimates in hand, an average estimate can be obtained that more adequately accounts for the uncertainty, hopefully providing more robust inference than from a single impute.

There are a variety of ways to generate multiple imputations. We will use a set of alternative predictive models in order to do this. We can select different combinations of the available predictor variables to build logistic regression models for mother_hs. We will create a suite of models to obtain a range of predictions. For example:


In [ ]:
impute_subset2 = test_scores.drop(labels=['family_inv','prev_disab','score', 'male', 'non_white'], axis=1)
y2 = impute_subset2.pop('mother_hs').values
X2 = StandardScaler().fit_transform(impute_subset2.astype(float))

In [ ]:
mod2 = LogisticRegression()
mod2.fit(X2[~missing], y2[~missing])
mod2.predict(X2[missing])

Surprisingly few imputations are required to acheive reasonable estimates, with 3-10 usually sufficient. We will use 3.

Here are 3 (arbitrarily-chosen) sets of predictors:


In [ ]:
subsets = [['male', 'age_test', 'early_ident', 'non_severe_hl'], 
 ['siblings', 'non_english', 'early_ident', 'non_severe_hl'],
 ['male', 'siblings', 'age_test', 'non_english']]

In [ ]:
mother_hs_imp = []

for subset in subsets:
    
    impute_subset = test_scores[subset]
    y = test_scores['mother_hs'].values
    X = StandardScaler().fit_transform(impute_subset.astype(float))
    
    mod = LogisticRegression()
    mod.fit(X[~missing], y[~missing])
    imputed = mod.predict(X[missing])
    mother_hs_imp.append(imputed)

In [ ]:
mother_hs_imp

Now we can perform 3 separate analyses, using the method of our choice, each based upon a different set of imputed values.

We will


In [ ]:
from sklearn import linear_model

coefficients = []

for imputes in mother_hs_imp:
    
    regr = linear_model.LinearRegression()
    
    X = test_scores.drop(labels=['family_inv','prev_disab'], axis=1)
    X.loc[missing, 'mother_hs'] = imputes
    y = X.pop('score')
    regr.fit(X, y)
    coefficients.append(regr.coef_)

In [ ]:
coeff_labels = ['male',
                'siblings',
                'non_english',
                'age_test',
                'non_severe_hl',
                'mother_hs',
                'early_ident',
                'non_white']

coef_df = pd.DataFrame(coefficients, columns=coeff_labels)
coef_df

In [ ]:
coef_df.mean()

In [ ]:
regr_complete = linear_model.LinearRegression()
X_complete = test_scores.drop(labels=['family_inv','prev_disab'], axis=1).dropna()
y_complete = X_complete.pop('score')
regr_complete.fit(X_complete, y_complete)
pd.Series(regr_complete.coef_, index=coeff_labels)

In [ ]:
regr_mean = linear_model.LinearRegression()
X_mean = test_scores.drop(labels=['family_inv','prev_disab'], axis=1)
X_mean = X_mean.fillna(X_mean.mean())
y_mean = X_mean.pop('score')
regr_mean.fit(X_mean, y_mean)
pd.Series(regr_mean.coef_, index=coeff_labels)